6 research outputs found

    Deep Interactive Region Segmentation and Captioning

    Full text link
    With recent innovations in dense image captioning, it is now possible to describe every object of the scene with a caption while objects are determined by bounding boxes. However, interpretation of such an output is not trivial due to the existence of many overlapping bounding boxes. Furthermore, in current captioning frameworks, the user is not able to involve personal preferences to exclude out of interest areas. In this paper, we propose a novel hybrid deep learning architecture for interactive region segmentation and captioning where the user is able to specify an arbitrary region of the image that should be processed. To this end, a dedicated Fully Convolutional Network (FCN) named Lyncean FCN (LFCN) is trained using our special training data to isolate the User Intention Region (UIR) as the output of an efficient segmentation. In parallel, a dense image captioning model is utilized to provide a wide variety of captions for that region. Then, the UIR will be explained with the caption of the best match bounding box. To the best of our knowledge, this is the first work that provides such a comprehensive output. Our experiments show the superiority of the proposed approach over state-of-the-art interactive segmentation methods on several well-known datasets. In addition, replacement of the bounding boxes with the result of the interactive segmentation leads to a better understanding of the dense image captioning output as well as accuracy enhancement for the object detection in terms of Intersection over Union (IoU).Comment: 17, pages, 9 figure

    Photometric stereo for strong specular highlights

    Full text link
    Photometric stereo (PS) is a fundamental technique in computer vision known to produce 3-D shape with high accuracy. The setting of PS is defined by using several input images of a static scene taken from one and the same camera position but under varying illumination. The vast majority of studies in this 3-D reconstruction method assume orthographic projection for the camera model. In addition, they mainly consider the Lambertian reflectance model as the way that light scatters at surfaces. So, providing reliable PS results from real world objects still remains a challenging task. We address 3-D reconstruction by PS using a more realistic set of assumptions combining for the first time the complete Blinn-Phong reflectance model and perspective projection. To this end, we will compare two different methods of incorporating the perspective projection into our model. Experiments are performed on both synthetic and real world images. Note that our real-world experiments do not benefit from laboratory conditions. The results show the high potential of our method even for complex real world applications such as medical endoscopy images which may include high amounts of specular highlights

    Linguistische Interpretation visueller Inhalte durch Deep Learning

    Get PDF
    The main part of the research outlined in this thesis is to develop Deep Learning models for the linguistic interpretation of the visual contents. This part is split into two research problems: interactive region segmentation and captioning, and selective texture labeling. In the first attempt, we proposed a novel hybrid Deep Learning architecture whereby the user is able to specify an arbitrary region of the image that should be highlighted and described. The proposed model alternates the bounding box indications of the standard object localization process with the output of a deep interactive segmentation module to achieve a better understanding of the dense image captioning and improve the object localization accuracy. The idea of the next part is to establish a bidirectional correlation between deep texture representation and its linguistic description via a hybrid CNN-RNN model that enables end-to-end learning of the selective texture labeling. This novel architecture provides new opportunities to describe, search, and also retrieve texture images from their linguistic descriptions. To be able to train such a model, we generated a multi-label texture dataset that covers color, material, and pattern labeling simultaneously. Our contribution to the automatic generation of texture descriptions provides an excellent opportunity to enrich the existing vocabulary of the image captioning. Such a conceptual extension can be used for fine-grained captioning applicable in geology, meteorology and other natural sciences where fine-grained image structures are of importance to understand complicated patterns. Apart from Deep Learning technologies, in the final section of the thesis, we proposed a novel approach to define mathematical morphology on color images. To this end, we converted common RGB-values of the color images into a new biconal color space and then combined two approaches of mathematical morphology to give meaning to the maximum and the minimum of the matrix field data and formulate our novel strategy.Der Hauptteil dieser Arbeit besteht darin, Deep Learning (DL) Modelle für die sprachliche Interpretation von visuellen Inhalten zu entwickeln. Dieser Teil ist in zwei Forschungsprobleme gegliedert: Interaktive Segmentierung und Beschriftung (Captioning) von Regionen und selektive Texturbeschriftung. Zur Lösung der ersten Aufgabenstellung wurde eine neue, hybride DL-Architektur vorgestellt, bei der der Benutzer eine beliebige Region eines Bildes anklicken kann, die automatisch segmentiert und sprachlich beschrieben werden soll. Das neue Modell wechselt die Bounding-Box-Indikationen des Standardobjektlokalisierungsprozesses mit der Ausgabe eines interaktiven, Deep Learning-basierten Segmentierungsmodules. Auf diese Weise wird ein besseres Resultat für das Captioning erzielt, und auch die Objektsegmentierung wird verbessert. Im zweiten Teil der neuen Entwicklungen wird eine wechselseitige Korrelation zwischen DL-basierten Texturrepräsentationen einerseits und deren sprachlicher Beschreibung andererseits mittels einer neuen, speziellen Konstruktion neuraler Netze hergestellt. Dieses neuartige, hybride CNN-RNN-Modell ermöglicht ein sogenanntes end-to-end Learning von Texturbeschreibungen. Wie aufgezeigt wird, entstehen durch die spezielle Architektur dieses neuralen Netzes neue Möglichkeiten zur Charakterisierung, gezielten Suche und zum Retrieval von texturierten Bildern mittels Eingabe sprachlicher Beschreibungen. Um ein solches Modell trainieren zu können, haben wir einen Multi-Label-Textur-Datensatz generiert, der Farbe, Material und Musterbeschriftung gleichzeitig abdeckt. Der Beitrag dieser Arbeit zur automatischen Generierung von Texturbeschreibungen bietet eine ausgezeichnete Möglichkeit, das bestehende Vokabular der Bildunterschriften zu erweitern. Eine solche konzeptionelle Erweiterung kann für feinkörnige Bildunterschriften in der Geologie, Meteorologie und anderen Naturwissenschaften verwendet werden, bei denen feinkörnige Bildstrukturen für das Verständnis komplexer Muster von Bedeutung sind. Im letzten Abschnitt der Arbeit wird abgesehen von Deep Learning-Technologien ein neuartiger Ansatz für die Realisierung von morphologischen Prozessen auf farbigen Bildern vorgestellt. Zu diesem Zweck haben wir gemeinsame RGB-Werte der farbigen Bilder in einen neuen bikonalen Farbraum umgewandelt und anschließend zwei Ansätze der mathematischen Morphologie kombiniert, um dem Maximum und dem Minimum der Matrixfelddaten Bedeutung zu verleihen und unsere neue Strategie zu formulieren

    Modelling the Energy Consumption of Driving Styles Based on Clustering of GPS Information

    No full text
    This paper presents a novel approach to distinguishing driving styles with respect to their energy efficiency. A distinct property of our method is that it relies exclusively on the global positioning system (GPS) logs of drivers. This setting is highly relevant in practice as these data can easily be acquired. Relying on positional data alone means that all features derived from them will be correlated, so we strive to find a single quantity that allows us to perform the driving style analysis. To this end we consider a robust variation of the so-called "jerk" of a movement. We give a detailed analysis that shows how the feature relates to a useful model of energy consumption when driving cars. We show that our feature of choice outperforms other more commonly used jerk-based formulations for automated processing. Furthermore, we discuss the handling of noisy, inconsistent, and incomplete data, as this is a notorious problem when dealing with real-world GPS logs. Our solving strategy relies on an agglomerative hierarchical clustering combined with an L-term heuristic to determine the relevant number of clusters. It can easily be implemented and delivers a quick performance, even on very large, real-world datasets. We analyse the clustering procedure, making use of established quality criteria. Experiments show that our approach is robust against noise and able to discern different driving styles
    corecore